Read in the data set.
PROdata <- read.csv(dfile("PROdata.csv"))
head(PROdata)
Let’s examine the demographic data to make sure there is nothing unusual
ggplot(PROdata) +
geom_bar(aes(x=SEX, fill=TRT))
There are more males than females. There do not seem to be any data errors, except for some missing values.
Now let’s examine the ages of these patients for possible data errors.
library(plotly)
## We want to match the colors in plotly with the colors in ggplot
library(scales)
hex <- hue_pal()(2)
plot_ly() %>%
add_trace(data=PROdata,
y = ~AGE,
color = ~TRT,
colors=hex,
type="box",
boxpoints="all",
jitter=0.7,
marker=list(size=5))
There do not appear to be any data errors in the age demographic. Let’s exmaine the Patient Reported Outcome questionnaire data.
We transform the data so that it is vertical regarding the questionnaires. This will allow us to trellis the plots.
PROdata.vert <- PROdata %>%
tidyr::pivot_longer(
cols=starts_with("Q"),
names_to = "Q_no",
values_to = "result",
names_prefix = "Q",
values_drop_na=FALSE
)
Let us compare the sites regarding the response to the questionnaires for the active and control groups.
ggplot(PROdata.vert) +
geom_histogram(aes(x = result, fill=TRT),
binwidth=4,
alpha=.5,
position="identity") +
labs(title = "Questionnaire Data",
x = element_blank(),
y= element_blank()) +
facet_grid(SITE~Q_no) +
theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank()) +
scale_y_continuous(breaks=NULL) + scale_x_continuous(breaks=NULL)
## Warning: Removed 23 rows containing non-finite values (`stat_bin()`).
It appears that Site 4 is an outlier. This is probably the site where the scores were reversed. Let’s do some data munging to correct this data error.
#========#
# SWITCH #
# SCORES #
#========#
flip.scale <- function(x){
## Simulate getting the scale wrong on a 5 point likert scale
## re-assign their values 1->5, 5->1, 2->4, 4->2, 3 <-> 3
case_when(
x == 1 ~ 5,
x == 5 ~ 1,
x == 2 ~ 4,
x == 4 ~ 2,
x == 3 ~ 3
)
}
## apply the function
site4.flipped.scores <- flip.scale(PROdata.vert$result[PROdata.vert$SITE=="SITE04"])
tmp <- PROdata.vert
tmp$result[tmp$SITE=="SITE04"] <- site4.flipped.scores
## verify results
ggplot(tmp) +
geom_histogram(aes(x = result, fill=TRT),
binwidth=4,
alpha=.5,
position="identity") +
labs(title = "Questionnaire Data",
x = element_blank(),
y= element_blank()) +
facet_grid(SITE~Q_no) +
theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank()) +
scale_y_continuous(breaks=NULL) + scale_x_continuous(breaks=NULL)
## Warning: Removed 23 rows containing non-finite values (`stat_bin()`).
The data looks correct now. We will use this transformed data to create a plot of the scores, comparing the 2 treatments across all sites.
PROdata.vert.corrected <- tmp
write.csv(PROdata.vert.corrected, dfile("PROdata.vert.corrected.csv"))
## Summarize the data
PROdata.sum <- PROdata.vert.corrected %>%
group_by(TRT, Q_no) %>%
count(result) %>%
mutate(percent = n /sum(n))
## Save the data for use in RAWgraphics
write.csv(PROdata.sum, dfile("PROdata.sum.csv"))
## Plot the data
p <- ggplot(PROdata.sum, aes(x=result, y=percent, fill=TRT))
p <- p + geom_bar(position='dodge',stat='identity')
## label the facets
Q.labs <- c("1" = "PRO Question 1", "2" = "PRO Question 2", "3" = "PRO Question 3", "4" = "PRO Question 4" )
p <- p + facet_wrap(vars(Q_no),
ncol=2,
labeller = labeller(Q_no=Q.labs))
## add title
p <- p + ggtitle("Active group consistently scores higher than Control group\n on all PRO questions") +
theme(plot.title=element_text(hjust=0.5))
p + scale_y_continuous(labels = scales:: percent_format(accuracy = 1))